2024-12-22
Importance of human judgment, context-knowledge
High quality data, in comparison to automated methods such as dictionnaries
Highly time consuming and human labor intensive
Often need to rely on a small sample of texts, which can be biased
## Getting labelled data {.smaller}
Goal is to learn a function that maps input text features to labels,
To do so, we train a algorithm predicting Y (labels) from X (text features) to find the best parameters to minimize the error
A supervised classification model is a function that learns the relationship between input text features and labels and returns the probability of a text belonging to a category
Classical models : Logistic regression, Naive Bayes, SVM, Random Forest
Transformers models (eg: BERT)
Equation :
\[ \ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \ \] - Accuracy is highly limited for imbalanced datasets
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive | False Negative |
| Actual Negative | False Positive | True Negative |
\[ \ \text{Recall} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Negative}} \ \]
\[ \ \text{Precision} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Positive}} \ \]
\[ \ \text{f1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \ \]
Supervised text classification